Individual Poster Page

See copyright notice at the bottom of this page.

DIPS year-to-year correlations, 1972-1992 (August 5, 2003)

Posted 10:14 p.m., August 7, 2003 (#64) - Alan Jordan(e-mail)
Warning retread -

Tango invited me to comment on this thread last night. It took a couple of hours to read through all the posts, and check a few of the simululations.

Here is another way of looking at the test-retest correlation in terms of what it’s supposed to measure Player’s ability vs change in ability and how the two should be related to the Rsqr. Of course R is simply the square root of Rsqr.

Part 1
Rsqr=Variance of Player’s ability/(Variance of Player’s ability + Variance Change in ability from year to year).

The above assumes that player’s ability and change in ability from year to year are independent (unobservable ability, not observable performance). It seems reasonable at the moment and I can always generalize it if need be.

I can’t give you a mathematical proof, but I would start it by assuming that change in ability is the error and ability is the model. I can give you this for those of you who are programmers and have some software to stats.

Step1 Generate a variable called X with a variance of 4 (don’t worry about the distribution). Generate in the same step a variable called err with a variance of 9. Create a variable called y as the sum of X and Err.

Step 2. Calculate the Rsqr between X and Y and it will be close to .3.
.3=4/(4+9)

Part 2
What does this mean to FJM’s denominator hypothesis. An event ratio with a small probability (or a probability far above .5) will have a small variance because in binary data the probability and the variance are related - Variance of p=p*(1-p). That small variance will cause a small r. Since singles have a higher p they should have a higher variance and hence a higher r.

Tango – what is the correlation of the logit of these events for each year (natural log of the odds ratio). Is that doable for this weekend? If the variance is proportional to the probability of the event then transforming these data will remove that and get us a better picture. I.e. the r for 1B/PA may be higher than the rest of the other r’s strictly because it has a higher probability.

Erik Allen -

You said -
The standard deviation is given by
STD = sqrt(n*p*(1-p))

This is wrong. STD =Sqrt(P*(1-P)). N doesn’t come in to play.

The standard deviations are:
1BSTD= 0.4
xbhSTD= 0.3

but your conclusion is correct:

“If you happen to locate a statistic that displays a HIGHER year-to-year correlation, …, then this would seem to imply that the differences in player ability outweigh the variability of the statistic.”

Tango-

Anyway, for these 20 pitchers, here are their year-to-year r
2b: .18, 1b: .47, out: .11
Wouldn't we have expected the out, with the highest numerator, to have the highest r, based on your previous explanation?
Don’t read too much into these you have a sample size of 20. If you don’t have a sample size of 20 then you probably did something wrong. My hunch is that even with 1000 pitchers of equal ability, you’re correlations will be insignificantly different from zero.

Erik Allen-
Corr = sum over i [(x_i-x_avg)*(y_i-y_avg)]
You subtracted the mean, but forgot to divide by the standard deviation. The Corr is a covariance of standardized variables. What you have is a covariance of centered variables.

“In your first simulation, all 20 pitchers should have the same ability. Therefore, if pitcherX were ABOVE average one year, we should not expect him to be ABOVE average the second year, and I would think that corr=0 for a sufficiently large sample.”

I think that’s a good insight.

“range of 0.09 to 0.11. So, on a relative basis, these are the same ranges. The correlation coefficients here are:
1B = 0.46
xbh = 0.28
So, from here we can see that there is significantly less predictability in xbh rate, despite the fact that the relative variation in the statistics is approxiamtely the same.”

I ran your simulation and got about the same numbers you did, but I can’t find a transformation that will equalize the r’s. I tried logit, ln and squareoot. Even a non parametric correlation didn’t do the trick. Beats me.

Tango-
You said:

“Will the larger spread in talent among pitchers allow us to get an r to approach 1?”
Absolutely.

FJM-
You said:
“But that assumes every 0.20 pitcher remains a 0.20 pitcher, every 0.18 pitcher stays right there, and so on. How realistic is that? Well, if the range of abilities is very narrow, then the chance of any pitcher greatly improving (or worsening)is very remote. But if the range is very wide, significant changes in year-to-year ability are certainly possible.”
No, the abilities and changes in abilities are assumed to be independent.

“So you can get a small r in either of 2 ways: 1)very small differences in true ability among pitchers with a lot of random variation, or 2)large differences in true ability accompanied by large year-to-year variation in that ability for individual pitchers”
Exactly

Tango-
You said:
“ To recap, the year-to-year r is dependent on:
1 - how many pitchers in the sample
2 - how many PAs per pitcher in year 1
3 - how many PAs per pitcher in year 2
4 - how much spread in the true rates there are among pitchers (expressed probably as a standard deviation)
5 - possibly how close the true rate is to .5
6 - the true rate being the same in year 1 and year 2”
All are true but number 1. The number of pitchers effects the standard error or the precision of our estimate. With only 20 pitchers, our estimate of r might be too high or too low, but r itself remains unchanged.